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Abstract 

Array-Based Comparative Genomic Hybridization (aCGH) is a method used to 
search for genomic regions with copy numbers variations. For a given aCGH profile, 
one challenge is to accurately segment it into regions of constant copy number. 
Subjects sharing the same disease status, for example a type of cancer, often have 
aCGH profiles with similar copy number variations, due to duplications and deletions 
relevant to that particular disease. 

We introduce a constrained optimization algorithm that jointly segments aCGH 
profiles of many subjects. It simultaneously penalizes the amount of freedom the 
set of profiles have to jump from one level of constant copy number to another, 
at genomic locations known as breakpoints. We show that breakpoints shared by 
many different profiles tend to be found first by the algorithm, even in the presence 
of significant amounts of noise. 

The algorithm can be formulated as a group LARS problem. We propose an 
extremely fast way to find the solution path, i.e., a sequence of shared breakpoints 
in order of importance. For no extra cost the algorithm smoothes all of the aCGH 
profiles into piecewise-constant regions of equal copy number, giving low-dimensional 
versions of the original data. These can be shown for all profiles on a single graph, 
allowing for intuitive visual interpretation. Simulations and an implementation of 
the algorithm on bladder cancer aCGH profiles are provided. 

1 Introduction 

Array-based Comparative Genomic Hybridization (aCGH) is a technique that aims to 
detect chromosomal aberrations on a genomic scale in a single experiment. In tumors for 
example, chromosomal aberrations include tumor suppressor genes being inactivated by 
deletion, and oncogenes being activated by duplication. Such changes to the underlying 
"normal" genomic state, known as copy number variations (CNVs) provide information 
related to the current disease state of the individual. 

Many cancers are known to exhibit recurrent CNVs in specific locations in the 
genome, for example monoploidy of chromosome 3 in uveal melanoma [20], loss of chro- 
mosome 9 in bladder carcinomas [1], loss of lp and gain of 17q in neuroblastoma [2,24] 
and amplifications of lq, 8q24, llql3, 17q21-q23 and 20ql3 in breast cancers [26]. 

aCGH allows rapid mapping of CNVs of a tumor sample at the genomic level [17]. 
Originally, the technique was based on arrays with several thousand large insert clones 
(e.g., bacterial artificial chromosomes or BACs) with a megabase resolution. More re- 
cently, this has been improved using oligonucleotide-based arrays with up to hundreds 
of thousands of probes, bringing the resolution down to several kilobases [6]. Each spot 
on an aCGH array contains amplified or synthesized DNA of a particular region of the 
genome. The array is hybridized with DNA extracted from a sample of interest, and in 
most cases with (healthy) reference DNA. Each of the two samples is then labeled with 
its own fluorochrome and the ratio of fluorescence between the two samples is expected 
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to reveal the ratio of DNA copy number at each position of the genome. Usually the 
logarithm of this ratio is then taken and the values (profile) plotted linearly along the 
genome, as shown for example in Fig. 0h)-(c). 

One challenge related to aCGH profiles is to detect regions of constant copy number, 
separated by breakpoints, in the presence of significant amounts of noise. Many groups 
have attempted to answer this question when treating a single profile, with various meth- 
ods for segmentation and smoothing of 1-dimensional profiles [5, 8, 9, 13, 15, 16, 23, 25]. 
Recently, approaches have been suggested for dealing with multiple aCGH profiles. In- 
deed, we often have experimental data from a series of patients who have the same 
disease status, or several groups of patients, each with a particular disease status. The 
goal of jointly treating several profiles is to extract breakpoints and copy number varia- 
tions that are globally representative of the disease status of those patients. The STAC 
algorithm [3] uses a novel search strategy to find regions of copy number gains and 
losses that are statistically significant across a set of profiles. It assigns p- values to each 
location on the genome using a multiple testing corrected permutation approach. The 
approach of [12] is based on kernel regression. One disadvantage of this is that results 
are dependent on the width of the kernel window, so a region may exhibit a statisti- 
cally significant copy number variation for one width, but not for another, leading to 
difficulties in interpretation. The approach in [19] is to first discretize profiles indepen- 
dently into gains, losses and normal. Then, they look for regions in which a significant 
number of the profiles share gains or losses using a Boolean framework and theoretical 
results for pattern searching. This method potentially loses pertinent information in the 
discretization step. Finally, the approach in [14] is to simultaneously segment a set of 
profiles using mixed linear models to account for both covariates and correlations be- 
tween probes. They proposed an EM-type algorithm that uses dynamic programming 
during the segmentation step. Though the computational cost was not provided, the 
algorithm appears computationally intensive. 

In this article, we present an algorithm that jointly renders a set of n profiles 
piecewise-constant, with the intended goal of uncovering a set of breakpoints and regions 
of constant copy number that are pertinent with respect to the whole set of profiles. This 
algorithm generalizes a Fused Lasso-type algorithm that was used to find breakpoints 
(and smooth) a single profile in [7]; there, they transformed the problem into a standard 
Lasso [22] and proposed an extremely fast way to find the whole solution path, that 
is, an ordered set of pertinent breakpoints. When several profiles must be segmented 
simultaneously, we propose to impose a fused constraint jointly on the set of profiles, 
leading to a reformulation of the problem as a group LARS or group Lasso [27]. The 
originality of our approach lies in the fact that the fused constraint imposes that each 
newly-selected breakpoint be shared by all profiles. In this formulation, we end up with 
a stupendously large matrix (with p 2 n 2 entries) to work with, but we show that to run 
a group LARS algorithm it is not necessary to store this matrix. 

In order to find k breakpoints in n profiles of p probes, our method has a complexity 
0(knp) in time and requires 0(np) in memory. In speed trials using Matlab on a 2008 
Macbook Pro with 4GB of RAM, for n = 20 profiles of p = 2000 probes, 50 breakpoints 
were found in 0.72 seconds, an average of 0.014 seconds each. Even at the limit of 
current aCGH technology, with p ~ 946 000 and, for example n = 20, the first 50 
selected breakpoints were obtained in 304 seconds, an average of about 6 seconds each. 
We present the performance of the algorithm on simulated data under various noise 
conditions, and demonstrate that it is able to recover the breakpoints shared by some or 
all of the profiles as the number of profiles increases. Finally we implement the algorithm 
to segment aCGH profiles of bladder cancer patients. 
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2 Methods 



Suppose we have n aCGH profiles each of length p. The p probe values are calculated 
at identical locations for each of the n profiles. For each profile, we want to find a 
piecewise constant representation, where the jumps between constant segments represent 
breakpoints, that is, places where the copy number changes. 

2.1 One profile, one chromosome 

Our starting point is the following framework for finding breakpoints in a one- 
dimensional piecewise-constant signal with white noise, as introduced by [7]. Let 
Y = (yi, . . . , i/p) be the observed signal. Consider the following constrained optimization 
problem: 

p 

min \\(3 — Y \\\ subject to \(3i — < /Jt, (1) 

^ 8=1 

where /x is a fixed non-negative constant and by convention 0q = 0. The constraint 
Y2i=i I A ~~ A-i| < /•* can be seen as a convex relaxation of bounding the number of 
jumps in 0, an idea also implemented in the fused Lasso [23]. When fi is small enough, 
this constraint causes the solution j3 to be made up of runs of equally- valued /3$ separated 
by an occasional jump from one constant to another, i.e., a piecewise constant function. 
For ji large enough the constraint is no longer effective and the solution is merely = Y. 

This problem is convex, meaning that any standard convex optimization package 
can solve it for a given /iora sequence {^j}j = i, though this remains computationally 
intensive. Making the change of variable u± = @i,U2 = 02 — Al> ■ ■ ■ >%> = /3 p — p -i, [7] 
rewrite ([T]) as 

p 

min \\Au — Y\fe subject to \u p \ < fj,, 

where A is a p x p matrix whose lower-diagonal and diagonal are 1 and upper-diagonal 
is 0. This is exactly a Lasso regression problem [22], whose complete solution path for 
ji can be obtained for example by the lars package [4] in a matter of seconds if p is of 
the order of several hundred. However, once p gets into the thousands, implementations 
like lars, which require storing a p x p matrix and that have complexity 0(A; 3 +pk 2 ) for 
calculating the first k breakpoints, greatly slow down or do not run. It was shown in [7] 
that the algorithm can be reformulated to take advantage of the structure of the matrix 
A without storing it, resulting in a computation time in 0(pk) with only the need to 
store vectors of length p. 

2.2 Many profiles, one chromosome 

Let us now consider a n x p matrix Y containing n profiles of length p. Our aim is to 
apply a similar procedure to the one profile case, but jointly to the whole set of profiles. 
We propose the following constrained optimization problem: 

p 

min -Y\\l subject to V ||A - A-1II2 < M> ( 2 ) 
/3eM nx p ~ i 

i=l 

where [i is a fixed non-negative constant, /5j is the vector of length n containing the value 
of the i th probe for each of the n individuals and by convention, 0q = 0. The choice 
of L2 norm in the constraint has the effect of creating long runs of consecutive vectors 
0i that are equal, interspersed with occasional jumps from one "constant" vector to the 
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next [27]. This mirrors the one profile case with its piecewise constant approximation, 
but here it is n profiles jointly. Intuitively, places where the set of profiles jointly "break" 
tend to be places where a large number of individual profiles exhibit a break. 

We now introduce a practical framework for solving such a problem. As for the one 
profile case [7], we make the change of variable ui = f3\, 112 = (h — fill ■ ■ ■ , u p = f3 p — f3 p -i, 
where all these objects are now n-dimensional vectors. This gives us the representation: 
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||Au — Y\\\ subject to \\ u i\\2 < 



where u is the np-dimensional vector of the p vectors Ui of length n stacked on top 
of each other, by momentary abuse of notation Y is the np-dimensional vector of the 
columns of Y stacked on top of each other and A is now as follows: for each 1 in the 
matrix A of the one profile case, replace it with an n x n identity matrix and for each 0, 
replace it with annxn matrix of zeros. The matrix A thus becomes an np x np matrix. 
In fact, this can be rewritten as a group Lasso [27], i.e., 
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Am - y 



i=l 



subject to \\ u i\U < M) (3) 



where Ai is the matrix of size np x n of the columns n(i — 1) + 1 up to ni of A, and Ui 
the z th column of u. Here, each group i is the set of n variables, one from each profile, 
found in position i on the genome. 

The group Lasso and group LARS algorithms [27] select variables "in groups" rather 
than one by one as for Lasso and LARS. This is useful when we have prior knowledge as 
to relationships between subsets of variables. In our case, the relationship is that each 
group represents the changes in value of n profiles between two adjacent locations on 
the genome. Selection of a group by the algorithm corresponds to choosing a location 
on the genome where a significant amount of change happens to a significant number 
of profiles. Whereas the standard algorithms for solving Lasso and LARS are almost 
identical, generalizations to group Lasso and group LARS are less so. The group LARS 
algorithm we describe here is therefore not the solution path to §3$), but to a similar 
problem. We chose to work with group LARS as similar ideas to those in [7] could be 
used to implement an extremely fast algorithm. 



2.3 "Many profiles, one chromosome" fast group LARS 

The group LARS algorithm we propose explicitly follows the steps given in [27], but 
avoids the suggested matrix formulation by generalizing the methodology of [7] . To find 
breakpoints, we follow the following steps, which have computational complexity 0(np), 
resulting in a complexity 0(npk) to find the first k breakpoints. 

• Step 1: start with u nxp = 0, k = 1 and r nxp = Y. 

• Step 2: compute the currently most correlated set: 

A\ = argmax [[^rill/pi. 

i 

By abuse of notation, r has been momentarily written as an np-dimensional vector 
of the columns of r stacked on top of each other. As Vi,j, pi = Pj = n here, we 
can ignore the division by pi. Due to the structure of A, to calculate the most 
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correlated set it suffices to calculate the cumulative sums by row of r nxp , starting 
in the p th column and moving back along the rows to the 1 st column. We call this 
resulting n x p matrix C. The column of C with the largest Li norm corresponds 
to the first selected group, i.e., breakpoint. This step takes 0(np) operations, and 
stores a n x p matrix. 

• Step 3: compute the current direction 7. We represent 7 as an n x p matrix. Each 
column of 7 whose index is not in the active set A k is identically zero. Calculating 
the rest of 7 usually requires having to calculate G 7 ^ = (iVi^) -1 , the inverse 
of an nk x nk matrix. In fact, due to the analogous structure of A with A from 
the one profile case in [7], it suffices to calculate the inverse Gj^ k = G^ fc of the 
relevant k x k matrix in the one profile case, which is a tridiagonal matrix with a 
closed form expression [7]. Then, with the generic notation B[:,i] meaning the i th 
column (s) of any matrix B, the remaining non-zero columns of 7 are obtained in 
O(nk) by the matrix multiplication: 

7 Mk] = C[:,A k ]xG' Ak . 

• Step 4 : for a ll i ^ Ak, compute how far the group LARS algorithm will progress 
in direction 7 before A{ enters the most correlated set. This can be measured by 
an Qj G [0, 1] such that 

||4(r- 0^7)11! = ||4,(r-a^ 7 )lli, 

where i' is arbitrarily chosen from Ak and r and 7 are again momentarily given as 
rap-dimensional vectors for notational simplicity. In [27], it was shown that aj is 
always well-defined. This equation is quadratic in aj. To solve it simultaneously 
for all % in Oinp): 

1 . define 7* as the nx p matrix we get by taking the row- wise cumulative sums 
of 7, starting this time at column 1 and finishing at column p. 

2. define 5* as the n x p matrix we get by taking the row- wise cumulative sums 
of 7*, starting at column p and finishing at column 1. 

3. the set of p quadratic equations is symbolically aa 2 + ba + c, with a = 
(ai, . . . , dp). Let Ak{j) mean the j th index in the active set, when sorted into 
ascending order. Then: 

4. to calculate a, first let a* = 5*. x S* - (5*[:, Ak(l)}- x <5*[:,.A fe (l)]) x l\ xp , 
where 'x' means matrix multiplication and '.x' means element by element 
multiplication, a is then the vector of length p given by the column-wise sums 
of a*. 

5. to calculate b, let b* = C. x 5* - (C[:, A k (l)}. x 5*[:, A k (l)}) x l lxp . Then, b 
is —2 times the column-wise sums of b* . 

6. to calculate c, let c* = C. x C - (C[:, Ak(l)]- x C[:,Ak(l)]) x l lxp . Then, c is 
the column- wise sums of c* . 

7. the quadratic equation can then be solved. 

• Step 5: select the smallest non- negative aj such that j Ak, then update the 
active set to include this new member: Ak+i = Ak U j. 

• Step 6: perform in 0(np) the updates: u = u + ajj, r = Y — A'y (with 7 momen- 
tarily written as an np-dimensional vector), k = k + 1 and update C with respect 
to the new r as was done in Step 2. Then go back to Step 3 and repeat until either 
p iterations are performed or a chosen number K max < p of breakpoints are found. 
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Figure 1: Speed trials, (a) CPU time for finding 50 breakpoints when there are 2000 probes and 
the number of profiles varies from 1 to 20. (b) CPU time when finding 50 breakpoints with the number 
of profiles fixed at 20 and the number of probes varying from 1000 to 10000 in intervals of 1000. (c) 
CPU time for 20 profiles and 2000 probes when selecting from 1 to 50 breakpoints. 

2.4 Many profiles, many chromosomes 

CGH profiles usually span many chromosomes. Constraints linking the last data point 
on each chromosome to the first data point on the next are meaningless. Removing these 
constraints means that the matrix A must be changed in the following way: replace Is 
with 0s whenever they corresponds to entries of the matrix indexed by two different 
chromosomes. This has minor follow-on effects in the algorithm, including calculation of 
cumulative sums and the matrix G^ k , and the updating of r. The speed of the algorithm 
is unaffected. 

3 Experiments 
3.1 Speed trials 

All trials used Matlab on a 2008 Macbook Pro with 4GB of RAM. Fig. QJa) indicates 
linearity in n, and 50 breakpoints were found in 0.72 seconds, an average of 0.014 seconds 
each. Fig. HJb) shows linearity in p. Fig. [TJc) shows, for n and p fixed, a near-linear 
relationship in k, i.e., subsequent breakpoints do not take longer to find than earlier 
ones. This confirms the theoretical 0(npk) complexity. 

To test the speed of the algorithm at the current limit of aCGH technol- 
ogy, we performed speed trials using sample data freely provided by Affymetrix 
(http://www.affymetrix.com) for the Affymetrix Genome- Wide Human SNP Array 6.0, 
which includes around 946 000 copy number probes. The dataset includes 5 sets of 5 
replicates. Three of the five individuals have abnormal copies of the X chromosome, 
with 3, 4 and 5 copies respectively. We calculated the log-ratio of 20 profiles (from 5 
replicates on 4 individuals) against the first profile of the first (normal) individual after 
removing Y chromosome probes, leaving 937 223 probes per profile. 

Again, the algorithm ran linearly in k. For 10 profiles, the group LARS algorithm 
took 3.4 seconds to find each subsequent joint breakpoint, and for 20 profiles, 6.1 seconds. 
Thus, the algorithm is computationally practical, extremely fast even, at the upper 
limits of current technology. We remark that for these 20 profiles, the fast group LARS 
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Figure 2: Simulating an aCGH profile with different levels of white noise, (a) 

is the underlying piecewise- const ant profile. White noise from a A/"(0, a 2 ) is added to the probes of (a) 
with: (b) a 2 = 0.01, (c) a 2 = 0.1, (d) a 2 = 0.2 and (e) a 2 = 0.5. 

algorithm correctly selected as the most important joint breakpoint (out of the 937 223 
possible choices) the jump from to the first probe on the X chromosome. 

3.2 Performance on simulated data 

We performed a series of simulations in order to verify that the algorithm behaved well 
with respect to our stated goals, namely, recover breakpoints shared by several of a set 
of profiles. We designed four experiments to move gradually from an artificial to a more 
realistic setting. 

1. All profiles share the same breakpoints. We simulated profiles of length 1000, each 
divided into ten constant-valued segments of length 100. Hence, there are 10 
breakpoints, located before probes 1, 101, 201, . . . , 901. The constant value of each 
of the 10 segments was randomly drawn from a uniform distribution on [—1,1]. 
White noise from a M(0, a 2 ) was then added independently to each of the 1000 
probe values. As shown in Fig. [21 we simulated with varying levels of noise: 
a 2 G {0.01,0.1,0.2,0.5}. In particular, we see that with a 2 = 0.5, the noise is 
often significantly larger than the distance between subsequent underlying constant 
segments. For a given value of a 2 , we randomly generated one profile in this way. 
We then asked the fast group LARS algorithm to find 10 breakpoints only. If 
these ten breakpoints corresponded exactly to the ten real breakpoints, we stopped. 
Otherwise, we added a second randomly generated profile and ran the group LARS 
jointly on these two, and so on. For each of 1000 such trials, we calculated the 
number of profiles needed to find the 10 real breakpoints as its first 10 predictions. 
A good algorithm should correctly select these breakpoints, given enough profiles. 

2. All profiles share the same breakpoint 'regions', though the breakpoints are not all 
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located at exactly the same probe on each profile. This is the same experimental 
condition as (1) except that each breakpoint that was fixed at location % can now be 
at any of {i — 2, i — 1, i, % + 1, i+2}, chosen uniformly. For simplicity, the breakpoint 
before probe 1 was kept fixed. For a given a 2 , we then run the algorithm as in 
(1), though in each loop we initially find the first 50 ordered breakpoints. We stop 
adding profiles one by one when the following event happens: every one of the 
first m > 10 ordered predicted breakpoints is included in one of the 10 breakpoint 
zones and there is at least one predicted breakpoint in each of the 10 zones, i.e., 
we find all of the 10 zones before adding a non-existant breakpoint. 

3. All profiles have a subset of a predefined set of breakpoints. Potential breakpoints 
are located in the 10 locations described in (1). First, the breakpoint before the 
first probe is automatically generated, i.e., the value of the constant segment from 
probe 1 to 100. Then, the potential breakpoint 2 before probe 101 is randomly 
included with probability 0.7. If it is included, we randomly select the value of the 
segment from 101 to 200 as in (1). Otherwise, the value of the segment from 1 to 
100 is continued from 101 to 200. We iterate this method up to the 10 th possible 
breakpoint. Thus, on average there are 7.3 breakpoints per profile (the first is 
always chosen and the 9 others chosen independently with probability 0.7). This 
is closer to what we see in reality, with aCGH profiles of patients with the same 
disease state sharing certain key breakpoints, yet but not all. 

We then proceed as in (1) for 1000 trials. One minor detail: if the first few 
randomly generated profiles only exhibit a subset of size s < 10 of the 10 possible 
breakpoints and the group LARS algorithm finds these s breakpoints before any 
others, we stop the trial at this point, as we cannot expect all 10 breakpoints to 
be found if some of them have not yet been randomly exhibited. 

4. All profiles have a subset of a predefined set of breakpoints though the exact location 
of each breakpoint can vary slightly between profiles. This experiment is a direct 
combination of (2) and (3). Again, 1000 trials were performed for each level of 
noise. The minor detail mentioned in (3) is treated in the same way here. 

Simulation results from experiments 1-4 are shown in Figures [30 The main re- 
sult is that, given enough profiles, the algorithm correctly selected the 10 breakpoint 
locations/regions for every experimental condition and every noise level. Specifically, 
in lower noise conditions (a 2 = 0.01, 0.1 and 0.2), rarely are more than fifty profiles 
needed to correctly select the true breakpoints. In more realistic conditions (a 2 = 0.5), 
with high probability, up to 75-200 profiles were necessary to correctly select the whole 
set of true breakpoints, though often much fewer were required. 

3.3 Application to bladder tumor CGH profiles 

We considered a publicly available aCGH data set of 57 bladder tumor samples [21]. 
Each aCGH profile gave the relative quantity of DNA for 2215 probes. We removed the 
probes corresponding to sexual chromosomes, because the sex mismatch between some 
patients and the reference used made the computation of copy number less reliable, 
giving us a final list of 2143 probes. 

Fig. EJa) shows the result of superimposing the smoothed versions of the 57 bladder 
tumor aCGH profiles, when the algorithm has selected 80 ranked common breakpoints. 
Figs 03(h) and (c) show 2 of the original 57 profiles and their associated smoothed version, 
where (b) was a profile exhibiting much instability, and (c) only on chromosome 9. We 
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Figure 3: Simulation conditions 1 (left) and 2 (right). Left: histograms of the number 
of profiles required to correctly predict 10 real breakpoints with no mistakes in the presence of white 
noise. Right: histograms of the number of profiles required to correctly predict all real breakpoints when 
each profile exhibits a breakpoint in each of 10 tightly defined regions, in the presence of white noise. 
The noise is Af(0,a 2 ) with (a) a 2 = 0.01, (b) a 2 = 0.1, (c) a 2 = 0.2 and (d) a 2 = 0.5. Each experiment 
was performed 1000 times. 
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Figure 4: Simulation conditions 3 (left) and 4 (right). Left: histograms of the number 
of profiles required to correctly predict all real breakpoints when each profile exhibits a subset of a 
predefined set of 10 breakpoints, in the presence of white noise. Right: histograms of the number of 
profiles required to correctly predict all real breakpoints when each profile exhibits a breakpoint in a 
subset of a predefined set of 10 tightly defined regions, in the presence of white noise. The noise is 
A/"(0,cr 2 ) with (a) a 2 = 0.01, (b) a 2 = 0.1, (c) a 2 = 0.2 and (d) a 2 = 0.5. Each experiment was 
performed 1000 times. 
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remark that even though (c) was forced to have the same breakpoints as (b), this does 
not translate into a poor smoothed version of (c), rather, the forced breakpoints are tiny 
jumps that can be ignored by biologists. 

Fig. 0(a) confirms nearly all of the duplications and deletions associated with bladder 
cancer found in [1,10,11]: frequent duplication of 8q22-24, 17q21 and 20q is observed, 
and frequent deletion of 8p22-23, 13q, 17p, lip and all of chromosome 9. The two 
known duplications that could not be confirmed here were 12ql4-15 and llql3. Fig. 
EJ^a) suggests other potentially important CNVs, including frequent duplication of lq, 
5p and deletion of 4q and lOq. 

4 Discussion 

Segmentation of a single aCGH profile into regions of constant copy number, separated 
by breakpoints, is a well-studied problem [5,8,9,13,15,16,23,25]. Recently, attempts 
have been made to deal simultaneously with many profiles [3, 12, 14, 18, 19]. In these 
methods, the search for important shared CNV regions tends to occur as the final step, 
either by choice of a level of significance [3] or in post-processing [19]. These methods 
do not use the biological prior information that different profiles are likely to share at 
least some breakpoints. Also, they may require the choice of pertinent kernel windows 
[12], lose information through discretization [3,19] or involve prohibitive computational 
complexity for large p [14]. 

To our knowledge, we have introduced for the first time a way to explicitly code the 
prior biological information of expecting patients with the same disease to share certain 
CNVs. Our method forces breakpoints to be located in the same places for all profiles. 
This has the effect of selecting breakpoint locations where many, but not necessarily all, 
profiles exhibit a breakpoint. This corresponds exactly to one of the underlying biological 
goals in CNV studies. As shown in Fig. Etc), it is important to note that a profile forced 
to have breakpoints where it clearly does not, still ends up with a good quality smoothed 
representation. We also showed that superimposing all smoothed versions on one graph 
allows intuitive visual interpretation of the data in a lower-dimensional form. On a real 
bladder cancer data set [21], the smoothed versions we obtained (Fig. [5ja)) confirmed 
nearly all of the known CNVs described in the articles [1, 10, 11]. Furthermore, our 
proposed algorithm is extremely fast. Even at the limit of current aCGH technology, it 
is practical, taking a few minutes on a single laptop computer. 

The piecewise-constant versions of the original profiles can be seen as extracted 
low-dimensional features. These can potentially be used to implement classification 
algorithms to discriminate between two or more classes of aCGH profile, e.g., different 
disease states. For example, each piecewise-constant profile could simply be treated as a 
vector of the constant values. As the breakpoints are forced to be in the same place on 
all profiles, the vector representation is the same size for each profile, directly opening 
the way for the use of many well-known classification methods. This is a promising 
research direction. 

The question of how many breakpoints to choose, i.e., when to 'stop' the algorithm, 
remains open. There are at least two possible solutions. First, if the algorithm were 
to be associated with a classification algorithm, a stopping criteria using internal cross- 
validation on the learning set can be defined. Second, it might be useful to calculate 
how much each subsequently selected breakpoint closes the distance between the set of 
smoothed and original profiles, and define a stopping criteria based on this. 
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Figure 5: Graphical representation, (a) superimposition of the smoothed versions of 57 
bladder tumor aCGH profiles [21] with 80 breakpoints. Vertical lines divide chromosomes f-22. (b) 
a profile exhibiting many CNVs, and its smoothed version, (c) a profile only showing a deletion on 
chromosome 9, and its smoothed version. Smoothed profiles are obtained by replacing the set of probe 
values between consecutive breakpoints with their mean value. 
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